2007-Oct-07
-----------

So far we've been very antagonistic towards any "clever" requirements
in a runtime: if the semantic model doesn't *spell out* the
optimization strategy, chances are no runtime will ever manage to be
smart enough to "discover" one.

This position has failed to address one important issue, and I need to
nail it down before going on: how do you efficiently schedule and
dispatch message-receives. If we can't do that efficiently, we're
sunk. Efficient here means: constant space per proc, constant
dispatching time.

So let's sketch the situation. Note this is not necessarily related to
handling dynamic alts and binding-channels-to-sets; it might be nice
if it handles it, but not essential. This is for the case of a fixed
set of ports and a fixed set of dispatching alts visible in the
proc. The only variable size is the number of queued callers and the
number

So, naively -- worst case -- we have 1 run queue including the
sleepers, we pull people out in random order, and each sleeper points
to the target it's trying to enter, and the first one to enter a proc
wins. That works, but it's effectively busy-waiting the sleepers when 
anyone is making progress; you only idle when *everyone* is blocked.

The crucial problem is devising the constant-size data structure that
maps N callers into queues in a variable partition of between 0 .. K
queues, such that you can do the following in O(1):

  - add a member to, or remove a member from, a queue
  - figure out if there's anyone in a queue
  - pick a representative from that queue at random

First, note the following facts:

  - doubling/halving vectors waste no more than 75% of their space.
  - 2-space copying collectors waste no more than 50% of their space

So we build a 2+10N-word structure given N processes:

  - 2 4N-word vectors representing 2 spaces for d/h queue sub-vectors,
    densely packed, copied back and forth between the spaces as they
    need reallocation.
  - 2 control words representing "end of current space" in each space.
  - a 2N-word vector of 2-word control records that refer into
    the 2 spaces

So a million processes require <=10 million words, or ~40mb, of
scheduling queues on a 32-bit machine. Not great, but not bad. Bounded
at least. And really, no process is going to eat less than 10 words of
stack and control state anyways; we just don't want to wind up
charging more for the queues than we do for the process abstraction in
a neutral-ish, low-stack form (say an auto proc with small-ish ports).
The proc can get by with 1 word in its header for every queue, and
combine all the auto ports into a single queue. So for abundant
million-copy auto-proc sort of procs, 1 word period: pointing to the
queue ID in the entry scheduler. Yay.

This gives us O(1) randomized scheduling on every queue assuming we
use an O(1) PRNG (I like ISAAC). Then when you execute an alt, it
picks 2 words from the PRNG, uses the first to pick a queue and the
second to pick a caller within the queue to wake (or should it pick 1
word mod total waiters, then distribute to the queue that held that
waiter? slightly slower to dispatch if you have a large number of
ports, but gives a different model of fairness...). Either way:
simple, fast, easy to guess what will happen, smoothes out load lumps
at every step of the system.

That last question is curious. Suppose you have a proc with a million
processes waiting on port A, and 1 process waiting on port B.

Should the B process get a ~1/1million chance to run, or a ~1/2 chance
to run? Which is easier to imagine the designer wanting more often?
Which is easier for the designer to work around if they mean the other
thing? I think the designer is more likely to want to get a guarantee
of the latter (1/2 chance) because they have more to count on that
way; they can always combine event streams together dynamically if
they want the other sort of behavior. It's easy enough to just have a
proc hold an internal work buffer that it mixes its own trouble into
and processes as it wishes. Also, it doesn't hurt that the 1/2 case
scales better :)

Also note that if you have dynamic ports, you may have a larger-size
scheduling pool.

Woo! I think this issue is (finally) solved. I chewed on it for 24+
hours solid. Bleah.

2007-Oct-10
-----------

Multithreading.

Hmm. I am generally not interested in multithreading: it's not worth
the pain. There are reasons you might want it, but they're few.

But let's just jot down "the simplest possible MT solution" and see if 
it works. First, note that modern systems guarantee word-aligned writes
are atomic and all have an atomic compare-and-swap (CAS) instruction.
So you are never going to have to race over refcount operations or 
twiddling owner IDs or whatever. So:

  - Every directly limited type has a single owner, which implies a 
    single reference. That much is obvious.

  - If I have a single reference and limitation infects, then the path
    from the thread register that owns me to me is all
    single-referenced values.

  - So: directly or indirectly limited slots make up a true
    tree. Always. There is always a true tree of process ownership,
    for example. Moving process references around only happens between
    portions of a tree-shaped data structure. If a thread can see node
    N in that tree, it is the only thread that can see the subtree
    under N.

  - The only stuff that might be "visible" to multiple threads
    simultaneously is the daggy "value" stuff. That's all refcounted
    along pointer edges. And we CoW any "value" into our own thread's
    tree any time we're performing an update: we don't expect other
    threads can "see" changes we make. We always edit "our own copy".
    There is no semantic "shared visibility mutation" in this
    language, never has been.

  - So all we need to reason about in MT terms is the CoW system and
    the message queue system, and interactions with the outside world.
    Which, unfortunately, we *always* have to reason about quite
    carefully anyways.

  - CoW is CAS-based. The path from your thread registers/stack to a
    heap memory write is *always* an lval-lookup LHS of an assignment
    statement. So to perform such a write you go step by step ensuring
    refcount=1. If you see a >1 edge you must pause, make a shallow
    copy, rewrite "your" rc=1 base value that's holding the shared
    edge to instead hold your copy, and carry on. When you get to the
    leaf you write.

    When you are forming a new reference to a shared value, you must
    increment the shared value refcount safely. You do this with a
    CAS.

    When you are dropping a refcount due to a value you're freeing,
    you must get the refcount to 0, then free. If you CAS the value
    1->0 successfully, you're the guy who gets to free the memory. If
    you're unsuccessful, someone else beat you to it (or re-referenced
    the value) and you're off the hook.

    aliases permit concurrent reads but reading through them should
    always use a memory barrier pointer-read sort of thing to ensure a
    coherent view. Or maybe it's that writes to heap pages always need
    to run a memory barrier op on the containing object? I can never
    remember...

  - The message queue system is a little hairier, but not hugely. You
    have a single Big Scheduler Of Processes and a bunch -- say 64 or
    512 -- of threads. You can activate only as many procs as you have
    threads at any moment, and you may occasionally have to have one
    wait on another due to a message send or an attempted proc
    lifecycle event; in those cases the proc becomes blocked, the
    thread sticks it back in the run queue, and it's "dormant data"
    for other threads to diddle.  You only possibly get stuck when you
    have an imperative like "I'm dying" and you are trying to clean up
    a proc you own. You can post the proc-death event but it might
    take a moment to "really" die. So you have a sort of grim reaper
    pseudo-proc that you have to message synchronously to confirm the
    death of another proc.
    
    ... OR ... you never have 1 thread running a proc with another
    thread running a parent proc. In order to activate a proc you have
    to block -- all the way up to the root of the proc tree -- its
    parents. So they're only likely to give you a little moment to run
    in, not a lot. That is probably a better rule. Explicit priority
    delegation: no inversion possible! The thread scheduler has to
    perform the topological allocation.

  - The other hairy part is that the message-send queue, the proc run
    queue, and the idle thread queue probably all have to be some
    fancy MT-safe lockless queues. Because the threads are all going
    to be independently and concurrently hammering on those
    things. But such structures exist.

---

Native pointers are limited, non-transmittable. Native functions are statically imported and
statically bound, single-entry, CDECL. Crate defn can specify transitive closure of acceptable
native dependencies, or * if you don't care.
